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Abstract 

This  note  attempts  to  give  a  detailed  error  analysis  of  Reciproot 
Algorithm  proposed  by  Kahan  and  Ng  in  1986.  It  is  showed  that 
the  algorithm  yields  correctly  rounded  square  root  under  all  rounding 
modes. 


1  Initial  Approximation 

Let  Xq  and  Xi  be  the  leading  and  the  trailing  32-bit  words  of  a  floating  point 
number  x  (in  IEEE  double  format)  respectively 
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x: 


1  11 


52 


/ 


Xq: 


1  11 


20 


s 


e 


/i 


x^: 


32 

/2 


By  performing  shifts  and  subtracts  on  Xq  and  j/o  (both  regarded  as  integers), 
we  obtain  a  7-bit  approximation  of  l/y^  as  follows. 


k  :=  0x5f  eSOOOO  —  (xo  1); 
j/o  :=  k  —  r2[63&(A;  14)]. 

Here  A;  is  a  32-bit  integer  and  T2[-]  is  an  integer  array  containing  correction 
terms.  Now  magically  the  floating  value  of  y  (j/’s  leading  32-bit  word  is  j/o, 
the  value  of  its  trailing  word  j/i  is  set  to  zero)  approximates  Ij y/x  to  more 
than  7-bit. 

Value  of  r2[64]  are 

0x1500,  0x2ef8,  0x4d67,  0x6b02,  0x87be,  0xa395,  0xbe7a,  0xd866, 
0xfl4a,  0xl091b, Oxllf cd, 0x13552, 0x14999, 0xl5c98,0xl6e34,0xl7e5f, 
0xl8d03 , 0xl9a01 , 0xla545 , 0xlae8a, 0xlb5c4 , OxlbbOl , Oxlbf de , 0xlc28d , 
0xlc2de , OxlcOdb , 0xlba73 , Oxlbllc , 0xla4b5 , 0xl953d , 0x18266 , 0xl6be0 , 
0xl683e,0xl79d8,0xl8a4d, 0x19992, 0xla789,0xlb445, Oxlbf 61, 0xlc989, 
0xldl6d , 0xld77b , Oxldddf , 0xle2ad , 0xle5bf , 0xle6e8 , 0xle654 , 0xle3cd , 
Oxldf 2a, 0xld635 , 0xlcbl6 , 0xlbe2c , 0xlae4e , 0xl9bde , 0xl868e , 0xl6e2e , 
0xl527f , 0xl334a, 0x11051 , 0xe951 ,  OxbeOl,  0x8e0d,  0x5924,  Oxledd. 


Now  we  prove  our  claimed  accuracy  of  y  as  an  initial  approximation  to 
Ij ^/x.  As  a  matter  of  fact,  x  can  be  written  as  x  =  (  — 1)®2®  X  1./  =  2®  X  1./. 
Tedious  analysis  shows  that 

k  ~  2~~  ^  ^1  +  1 - ,  if  e  is  odd,  (1) 

k  ~  2“2“1  /"l  -| - 2~^')  ’  ®  even,  (2) 


This 

table  needs  to 
he  checked! 
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where  f[  is  fi  except  its  last  bit  is  set  to  zero. 

Case  (i):  e  is  odd  and  bits  of  f[  are  not  all  zero. 

Denote  a  =  l.f[.  Since  1  <  a  <  2  —  2“^®,  we  have  2“^°  <  1  —  f  <  2“^. 
Assume  for  the  moment  that  a  >  1,  then  1  —  ^  <  2“^.  Write 


1 


a 

2 


0.00203040506  •  •  •  «21- 


63&(A;  14)  =  O02O3O4O5O6  which  gives  us  valuable  information  to  construct 

correction  terms.  Let 


t  =  O.O02O3O4O5O6,  e  =  O.OOOOOI2  =  2 
Now  we  know  that  t<l  —  j<t-he,  which  produces 

2(1  -t)  -2e  <  a  <  2(1  -t). 

(if  1  =  0,  we  have  2(1  —  e)  <  a  <  2.) 

We  are  required  to  choose  a  number  d  associated  with  those  information 
so  that 

a  .  2 

1  +  1  —  —  —  d  approximates  — ^===  accurately  enough. 

As  this  approximation  could  not  be  correct  to  more  than  20-bits,  we  can 
restate  it  as 

a  ■  2 

1  +  1  —  —  —  a  approximates  —  accurately  enough. 
Elementary  arguments  show  that  the  best  d  is 


1  +  1  +  e  — 


d  = 


V2(2(i-t-q) 


+  1  +  1  — 


V2(2(l-0) 


1  +  1  +  e  — 


yi-t-e 


+  1  +  1  — 


e  1  /  1  1 

—  +  /.  + 


2  2  \  VI  -1  - 
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except  possibly  for  one  case  when  the  interval  [2(1  —  t  —  e),  2(1  —  1)]  contains 
\/2,  for  which  t  =  0.0101  II2  =  0.359375.  Individual  check  shows  that  this 
gives  no  trouble.  To  hnd  the  largest  error  in  using  1  +  1  —  ^  —  d  to  approximate 
-3=,  it  suffices  for  us  to  look  at 


1  A  t  —  d .  ^ 

^2(2(1-!)) 

e  1  /  1  1  \ 

2  ^  2  -1-e 

_  e  1 

2\  [1  —  t  —  e)Vl  —  t  +  (1  —  t)Vl  —  t  —  e 


whose  absolute  value  for  all  possible  0  <  1  <  2  ^  is  less  than  or  equal  to 
f  X  0.4941  <  I  =  2-®. 

Case  (ii):  e  is  even  and  bits  of  f[  are  not  all  zero. 

Denote  a  =  l.f[.  Since  1  <  a  <  2  —  2“^®,  we  have  2“^  +  2“^°  <  <  1. 

Assume  for  the  moment  that  a  >  1,  then  <  1.  Write 

63&(A;  ;+  14)  =  10203040506  which  gives  us  valuable  information  to  construct 
correction  terms.  Let 


t  =  O.I02O3O4O5O6,  e  =  O.OOOOOI2  =  2  ®. 
Now  we  know  that  t  <  t  +  e,  which  produces 

3  -  2(1  +  e)  <  a  <  3  -  2+ 


We  are  required  to  choose  a  number  d  associated  with  those  information 
so  that 


1  + 


3  —  a 


2 

At 


2 


d  approximates 


accurately  enough. 
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As  this  approximation  could  not  be  correct  to  more  than  20-bits,  we  can 
restate  it  as 


1  + 


3  —  a 
2 


d  approximates 


2 

/a 


accurately  enough. 


Elementary  arguments  show  that  the  best  d  is 


1  +  t  +  e  — 


d  = 


+  1  +  t  — 


—  ^  +  t  +  -  — 


+ 


2  yy/3-2t  A3  -  2(t  +  e) /  ’ 


except  possibly  for  one  case  when  the  interval  [3  —  2{t  +  e),  3  —  2t]  contains 
for  which  t  =  O.IOIIOI2  =  0.703125.  Individual  check  shows  that  this 
gives  no  trouble.  To  hud  the  largest  error  in  using  —  d  to  approximate 
-dj=,  it  suffices  for  us  to  look  at 

^/r>'  * 


1  +  t 


d- 


V3^^ 


e 

'X  + 


2  V^3-2(f +  e)  V3  -  2t^ 


1  - 


(3  -  2{t  +  e))V3^^+  (3  -  2t)^3  -  2{t  +  e) 


1 


whose  absolute  value  for  all  possible  O.I2  <  t  <  O.IIIIII2  is  less  than  or 
equal  to  f  X  0.9543 

Case  (iii):  bits  of  f[  are  all  zero. 

In  this  case,  a  =  1.  Now  if  e  is  odd,  then  t  =  O.IOOOOO2.  The  correspond¬ 
ing  d  is  gotten  as  in  Case  (ii).  Computation  shows  the  error  cannot  exceed 
0.003502  2-®-!®^®. 

If  e  is  even,  then  t  =  O.OOOOOO2.  The  corresponding  d  is  gotten  as  in  Case 
(i).  Computation  shows  the  error  cannot  exceed  0.00386  ~ 

Now  we  reach  the  following  conclusion:  The  initial  guess  gives  up  to  7 
correct  bits  or  more  if  e  is  even;  while  up  to  8  correct  bits  or  more  if  e  is  odd. 


Reciproot  Algorithm — Correctly  Rounded? 


6 


2  Iteration  Refinement 

Apply  Reciproot  iteration  three  times  to  y  and  multiply  the  result  by  x  to 
get  an  approximation  2;  that  matches  y/x  to  about  1  nip.  To  be  exact,  we 


will  have 

—  1.0654  ulp  <  2;  —  a/x  <  1  ulp.  (3) 

Set  rounding  mode  to  Round-to-nearest  and  sequentially  do 

y  :=  y{l.h-t).hxy^),  (4) 

y  :=  j/((1.5- 2-4°) -0.5V),  (5) 

2;  :=  xy,  (6) 

2;  :=  z -\- tlAz{l  —  zy).  (7) 


To  analyze  the  accuracy  of  y  and  2;  after  each  step,  without  loss  of  gener¬ 
ality,  we  assume  1  <  x  <  4.  (x  =  1  or  4  can  be  checked  individually.)  Then 
1  <  ^/x  <  2  and  1  >  -^  >  O.R. 

•  After  initial  approximation  and  before  (4):  y  can  be  written  as 

1 

?/  =  ^  +  e, 

yx 

where  |e|  <  2-®-°°’^485  if  l  <  a;  <  2;  |e|  <  2“°  if  2  <  x  <  4. 

•  After  (4)  and  before  (5):  y  can  be  written  as 

y  = 

where  e'  is  for  rounding  errors.  Note  that  if  1  <  x  <  2,  1.5e^-V  + 
O.be^x  <  2“^®-°'^’^'^’^  and  if  2  <  x  <  4,  1.5e'^^/x  +  O.be^x  <  As 

rounding  error  at  this  stage  are  negligible  unless  e  ~  2  we  conclude 
that:  |ei|  <  if  1  <  x  <  2;  and  |ei|  <  2-4®-'^  if  2  <  x  <  4.  Further 

more  Ci  <  0  unless  it  is  of  order  2“®^. 


—  1.5e^-V"  ~  0.5e°a;  +  e 


—j=  +  ^1, 
'X 
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•  After  (5)  and  before  (6):  y  can  be  written  as 


y 


_  2-40  /  I  ^  \  _  i.5e2^_  0.5e?x  +  e" 

\/x  Vv®  / 

f 

~y=  —  ^2, 

\/x 


where  e"  is  for  rounding  errors.  Note  that  if  f  <  x  <  2,  f.5e^-^/ic  + 
O.befx  <  2-^o-oo9967  if  2  <  x  <  4,  lAC^/x  +  O.be^x  <  2-®^'^4_ 
rounding  error  at  this  stage  are  negligible,  we  conclude  that:  2-44  ^ 
£2  <  2-29-0090  if  1  <  X  <  2;  and  2-44  <  <  2-32-23  [{  2  <  x  <  4. 


•  After  (6)  and  before  (7):  2;  =  fl[xy)  =  ydr  —  €2X-\-€jm  where  |em|  <  2 
i.e.,  at  most  |  ulp  with  respect  to  1. 

•  Computations  in  (7):  fl{zy)  <  1  and 

fl(zy)  =  1  —  2e2^/x  +  H - ^  +  (neg.  terms), 

yx 

where  \c'^\  <  2-^4^  i.e.,  at  most  |  ulp  with  respect  to  1.  From  now  on 
“(neg.  terms)”  refers  to  some  negligible  terms  in  comparing  with  the 
unit  in  the  last  place  of  a  corresponding  expression.  No  rounding  error 
in  calculating 

1  —  fl(zy)  =  2e2\fx  —  e\x - ^  +  (neg.  terms). 

y  X 

Note 


//(0.52;(1  -  zy)) 

=  {yfx  -  t2X  +  e„)  ^e2^/x  -  ^  ^  +  (4ieg.  terms) 

Cm  c 

=  e2X  —  l.be^xyCc - ^ - (44eg-  terms), 

where  \e"^\  <  2-33|e2a;|.  Now 
fl{z  +  0.52;(1  -  zy)) 
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=  Cx  —  e2X  +  Cm  +  £2^  —  lAenxCx - - - 


+‘^m  +  e'm  +  (neg.  terms) 


=  a/x  —  1.5e^x^/x  +  “^ - 2^^^  (^®g-  terms) 


=  +  Vi 

where  \c'^\  <  2“®^,  i.e.,  at  most  |  ulp  with  respect  to  1.  Note 


- V  X  + 

2  2 


<  ^  ulp  +  i  ulp  +  i  ulp  =  1  ulp 


with  respect  to  1.  On  the  other  hand,  0  >  — >  —0.0654  ulp  if 
1  <  X  <  2,  and  0  >  —lAelx^/x  >  —0.0021  ulp  if  2  <  x  <  4.  Therefore 
we  have 

—  1.0654  ulp  <  ri  <  lulp. 


We  have  just  proved  (3). 

3  Final  Adjustment 

By  twiddling  the  last  bit  of  2;  it  is  possible  to  force  2;  to  be  correctly  rounded 
according  to  the  prevailing  rounding  mode  as  follows.  Let  r  and  i  be  copies  of 
the  rounding  mode  and  inexact  flag  before  entering  the  square  root  program. 
Also  we  use  the  expression  2;±  ulp  for  the  next  representable  floating  numbers 
(up  and  down)  of  2;. 

•  Case  RN — round-to-nearest\  In  this  case,  if  2;  —  ^/x  >  |ulp,  then  do 
z  =  z  —  ulp;  if  2;  —  yCr  <  —  \  ulp,  then  do  2;  =  2;  +  ulp;  otherwise  2;  is 
correctly  rounded  already. 

Set  rounding  mode  to  round-toward-zero  which  means  “chopped”. 

We  write  2;  =  ^/x  Arj  =  y?x  +  |  ulp  +  e  where  —1.564  ulp  <  e  <  |  ulp. 
Note 

2;(2;  —  ulp)  =  z'^  —  z  X  ulp 
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=  X  A  2riyd  +  T]^  —  \fx  X  ulp  —  T]  X  ulp 
=  X  -\- 


=  X  -\- 


x  +  (  iulp  +  (^e  -  iulp 

x  +  - ulp^. 


So  z{z  —  ulp)  >  X  if  and  only  if 


e  > 


iulp^ 


< 


ulp^ 


(8) 

di"  + +  I  ulp"^  »a/® 

As  computations  are  supposed  to  be  done  under  round-toward-zero^  we 
have  fl{z{z  —  ulp))  >  x  if  and  only  if  (8)  holds. 

Theorem  1  Th  ere  is  no  IEEE  double  precision  floating  point  number 
X  such  that  ^ 

X  =  a/x  +  -  ulp  +  e 

is  an  IEEE  double  precision  floating  point  number  for  some 

ulp^ 

0  <  e  <  ^ 


On  the  other  hand,  we  write  x  =  \/x  —  |  ulp  —  e  where  —1.5  ulp  <  e  < 
0.5654 ulp.  Note 

z{z  +  ulp)  =  +  X  X  ulp 

=  X  A  2gyCc  +  +  \/x  x  ulp  +  g  x  ulp 


=  X  — 


=  X  — 


X-  (  iulp  +  ^ 

X  - ulp^. 


So  z[z  +  ulp)  >  x  if  and  only  if 


e  < 


iulp^ 


< 


ulp^ 


(9) 

Ac  A  \J X  A 

As  computations  are  supposed  to  be  done  under  round-toward-zero^  we 
have  fl{z{z  +  ulp))  >  x  if  and  only  if  (9)  holds. 
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Theorem  2  Th  ere  is  no  IEEE  double  precision  floating  point  number 
X  such  that 


z  =  \/x - ulp  —  e 


is  an  IEEE  double  precision  floating  point  number  for  some 

ulp2 

0  <  e  <  ^ 


So  the  following  adjustment 

round-to-nearest 
chopped)  z  =  z  -  ulp;  else 
chopped)  z  =  z;  else  z  =  z+ulp. 


case  RN :  . . . 

if (x<=  z*(z-ulp) . . . 
if (x<=  z* (z+ulp) . . . 


will  produce  a  c  with 

\z  —  yfr\  <  -  ulp 

Proofs  of  the  above  theorems  will  be  given  in  the  next  section  at  the 
end  of  this  note. 

•  Case  RZ  or  Case  RM — round-to-zero  or  round-to — oo:  In  this  case,  if 
c  >  ydc,  then  do  z  =  z  —  ulp;  if  c  —  ^/x  <  —  ulp,  then  do  z  =  z  -\-  ulp; 
otherwise  c  is  correctly  rounded  already. 

Reset  rounding  mode  to  round-to-+oo. 

Note  z^  >  X  ii  and  only  if  c  >  ydc,  which  also  holds  in  floating  point 
operations  under  RP.  (c  +  ulp)2  <  X  if  and  only  if  z  —  ydr  <  —  ulp, 
which  also  holds  in  floating  point  operations  under  RP. 

So  the  following  adjustment 

case  RZrcase  RM:  ...  round-to-zero 

or  round-to-negative  infinity 
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if(x<z*z  ...  rounded  up)  z  =  z  -  ulp;  else 
if (x>= (z+ulp) * (z+ulp)  ...rounded  up)  z  =  z+ulp. 


will  yield  correctly  rounded  result. 

•  Case  RP — round-to-d-oo.  In  this  case,  if  —1  ulp  <  —  \/x  <  0,  then  do 

z  =  z  ulp;  else  if  z  —  y/x  <  —  ulp,  then  do  2;  =  2;  +  2  ulp;  otherwise 
2;  is  correctly  rounded  already. 

Under  rounding  mode — round-to-zero. 

Note  if  [z  +  ulp)^  <  r  if  and  only  if  2;  —  yd  <  —  ulp,  which  also  holds 
in  floating  point  operations  under  RZ.  [z  +  ulp)2  <  r  <  2;^  if  and  only 
if  —1  ulp  <  2;  —  yd  <  0,  which  also  holds  in  floating  point  operations 
under  RZ. 

So  the  following  adjustment 

case  RP:  ...  round-to-positive  infinity 

if (x> (z+ulp) * (z+ulp) ...  chopped)  z  =  z+2*ulp;  else 
if(x>z*z  ...chopped)  z  =  z+ulp. 


will  yield  correctly  rounded  result. 

To  determine  whether  2;  is  an  exact  square  root  of  r,  we  notice  an  nec¬ 
essary  condition  for  2;  to  be  an  exact  square  root  of  x  is  that  the  training 
26  bits  of  2;  must  be  zero.  So  if  the  training  26  bits  of  2;  is  not  zero,  raise 
Inexact  flag;  else  if  e  is  odd  and  the  26th  bit  of  2;  is  1  then  2;  is  not  exact; 
else  if  z"^  ^  X  (at  this  moment  fl{z‘^)  =  2;^),  then  2;  is  not  exact;  otherwise  2; 
is  exact. 
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4  Proofs  of  Theorems  1  and  2 


Let  us  prove  Theorem  1  first.  Assume  to  the  contrary,  there  were  an  IEEE 
double  precision  floating  point  number  x  as  described  in  the  theorem.  By 
scaling  x  and  2;  properly,  we  may  assume  that  1  <  x  <  4.  Note 

X  =  ~  2  ~ 

=  —  z  ulp  +  -  ulp^  —  2e2;  +  e  ulp  +  C .  (10) 


As  now,  ulp  =  2“®^,  and  thus  |  ulp^  =  2“^°®.  It  is  easy  to  see  that  in  binary 
form 

Z^  —  Z  ulp  =  0100.0-10-2  •  •  •  0-104, 

where  o^’s  are  either  1  or  0  and  Oi,  Oq  cannot  be  0  at  the  same  time.  Therefore 

Z^  —  Z  ulp  +  —  ulp^  =  O1O0.O-1O-2  •  •  •  O-104OI, 
which  proves  e  could  not  be  0.  If,  however,  e  >  0,  then 

0  >  — 2e2;  +  +  e ulp  =  e(— 2^;  +  e  +  ulp)  >  — 2e2;  > - ulp^. 


Thus  the  binary  expansion  of  2;^  —  2;  ulp  +  e  ulp  +  |  ulp  —  2e2;  +  C  could  not 
match  that  of  x,  contradicting  (10).  Theorem  1  is  proved. 

To  prove  Theorem  2,  we  apply  similar  trick  as  we  just  did.  Suppose  we 
had  such  an  IEEE  double  precision  floating  point  number  x  as  described  in 
the  theorem.  Without  loss  of  generality,  we  may  assume  1  <  x  <  4.  Note 

X  =  (2;+iulp  +  e)^ 

=  z^  A  z  ulp  T  —  ulp  T  2cz  T  e  ulp  +  (11) 


As  now,  ulp  =  2“®^,  and  thus  |  ulp^  =  2“^°®.  It  is  easy  to  see  that  in  binary 
form 

2;^  +  2;  ulp  =  0100.0-10-2  •  •  •  0-104, 
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where  a^’s  are  either  1  or  0  and  Oi,  Oq  cannot  be  0  at  the  same  time.  Since 
0  <  -  ulp^  +  2ez  +  e  ulp  + 

the  binary  expansion  of  A  ^  ulp  +  |  ulp^  +  2e2;  +  e  ulp  +  could  not  match 
that  of  X,  contradicting  (11).  Theorem  2  is  proved. 
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